# NYC Real Estate Sales (2016–2025, cleaned DuckDB)

## Overview

This dataset provides cleaned, row-level records of New York City real property sales from approximately **January 2016 through the latest available 2025 file** (as of October 2025). The data is packaged as a single [DuckDB](https://duckdb.org/) database for easy analysis.

- Sources:
	- NYC Department of Finance (DOF) Rolling Sales – borough files used for the most recent year-to-date (YTD) records
	- NYC Open Data “Citywide Annualized Calendar Sales” – consolidated/historical coverage
- Coverage: 2016–2025 YTD. YTD rows are sourced from the periodically published DOF rolling sales files; the exact end date reflects the latest DOF/citywide release available at build time.
- Update date: 2025-10 (based on the included files)

## Files provided

- `nyc.duckdb` – DuckDB database file with cleaned per-borough sales tables
- `public_README.md` – This documentation file

## Database contents

The database contains one table per borough with a harmonized schema:

- `manhattan`
- `brooklyn`
- `queens`
- `bronx`
- `staten_island`

All borough tables share the same canonical column names and types so they can be `UNION ALL`-ed for citywide analysis.

### Key cleaning and normalization steps

The raw borough and citywide Excel files were imported and standardized using a repeatable process:

1. Header detection and column name normalization (lowercase, snake_case, consistent spellings)
2. Removal of stray/empty Excel columns (e.g., `col_21`, `col_22`, `unnamed_*`)
3. Type enforcement (e.g., dates → DATE, numeric identifiers → BIGINT, ZIP as TEXT to preserve leading zeros)
4. Address and apartment normalization; extraction of standardized `apartment_number` when present in the address string
5. Computation of `address_base` (house number + street, sans unit)
6. Creation of a stable `sale_id` hash for de-duplication
7. Two-stage de-duplication:
	 - Prefer non-zero price for rows that are identical across address/unit/date
	 - Enforce uniqueness by `sale_id`
8. Building entity resolution: create `building_id` with a priority scheme (BIN → GEO → BBL → ADDRESS) and record `building_id_source`; cluster-based imputation upgrades are applied where possible
9. Cross-batch schema alignment so all borough tables retain a consistent set of columns

### Common columns (selected)

- `sale_date` (DATE)
- `sale_price` (BIGINT)
- `borough` (VARCHAR)
- `neighborhood` (VARCHAR)
- `address` (VARCHAR)
- `address_base` (VARCHAR) – normalized base address without unit
- `apartment_number` (VARCHAR) – standardized unit/apt when available
- `zip_code` (VARCHAR)
- `block` (BIGINT) 
- `lot` (BIGINT)
- `building_class_category` (VARCHAR)
- `building_class_at_present` (VARCHAR)
- `building_class_at_time_of_sale` (VARCHAR)
- `residential_units` (BIGINT)
- `commercial_units` (BIGINT)
- `total_units` (BIGINT)
- `land_square_feet` (BIGINT) – when present in source
- `gross_square_feet` (BIGINT) – when present in source
- `year_built` (BIGINT)
- `bin` (VARCHAR)
- `bbl` (VARCHAR)
- `latitude` (DOUBLE) – when present in source
- `longitude` (DOUBLE) – when present in source
- `census_tract_2020` (VARCHAR)
- `nta_2020` (VARCHAR) – when present in source
- `building_id` (VARCHAR)
- `building_id_source` (VARCHAR)
- `sale_id` (VARCHAR) – stable unique identifier used to avoid duplicates

## Example usage (DuckDB)

List tables:

```sql
SELECT table_name
FROM information_schema.tables
WHERE table_schema = 'main'
ORDER BY table_name;
```

Citywide union query for 2024 sales:

```sql
WITH all_sales AS (
	SELECT * FROM manhattan
	UNION ALL SELECT * FROM brooklyn
	UNION ALL SELECT * FROM queens
	UNION ALL SELECT * FROM bronx
	UNION ALL SELECT * FROM staten_island
)
SELECT borough, COUNT(*) AS n_sales, SUM(sale_price) AS total_sales
FROM all_sales
WHERE sale_date BETWEEN DATE '2024-01-01' AND DATE '2024-12-31'
GROUP BY 1
ORDER BY 1;
```

Example: sample normalized apartment/unit parsing output:

```sql
SELECT borough, neighborhood, address, apartment_number, sale_price, sale_date
FROM brooklyn
WHERE apartment_number IS NOT NULL
ORDER BY sale_date DESC
LIMIT 20;
```
